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(57) Abstract 

Methods for assigning a quantitative score to the relatedness of aligned 
polymorphic biopolymer sequences such that small differences between otherwise 
identical sequences are highlighted are disclosed, including computer systems and 
program storage devices for carrying out the methods on a computer. Specifically, 
the methods of the invention comprise the steps of providing a test sequence and 
a basis set of sequences such that the test sequence and a basis set of sequences 
are aligned; determining the identity of a monomer unit at a position m in the 
test sequence; assigning a value of 1 to a local matching probability x m if the 
monomer unit at position m in the test sequence matches any members of the basis 
set at position m, or, assigning a value of between 0 and 1 to a local matching 
probability x m if the monomer unit at position m in the test sequence does not match 
any members of the basis set at position m. In a preferred embodiment, the above 
method is performed at a plurality of sequence locations and the local matching 
probabilities are multiplied together to provide a global matching probability. 
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ALIGNMENT-BASED SIMILARITY SCORING METHODS FOR QUANTIFYING 
THE DIFFERENCES BETWEEN RELATED BIOPOLYMER SEQUENCES 

FIELD OF THE INVENTION 

This invention relates to methods for quantitatively 
determining the relatedness of biopolymer sequences. More 
specifically, the invention is directed to methods for scoring 
aligned polymorphic biopolymer sequences such that small 
5 differences between otherwise identical sequences are 
highlighted, including computer systems and program storage 
devices for carrying out such methods using a computer. 
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20 

BACKGROUND 

The identification of sequence homology between an unknown 
biopolymer test sample and a known gene or protein often 
provides the first clues about the function and/or the three 

25 dimensional structure of a protein, or the evolutionary 
relatedness of genes or proteins. Because of the recent 
explosion in the amount of DNA sequence information available 
in public and private databases as a result of the human genome 
project and other large scale DNA sequencing efforts, the 

30 ability to screen newly discovered DNA sequences against 
da cabases of known genes and proteins has become a particularly 
important aspect of modern biology. 
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Generally, the sequence comparison problem may be divided 
into two parts: (1) alignment of the sequences and (2) scoring 
the aligned sequences. Alignment refers to the process of 
introducing "phase shifts" and "gaps" into one or both of the 
5 sequences being compared in order to maximize the similarity 
between two sequences, and scoring refers to the process of 
quantitatively expressing the relatedness of the aligned 
sequences . 

10 Existing sequence comparison processes may be divided into 

two main classes: global comparison methods and local 
comparison methods. In global comparison methods, the entire 
pair of sequences are aligned and scored in a single operation 
(Needlman and Wunsch) , and in local comparison methods, only 

15 highly similar segments of the two sequences are aligned and 
scored and a composite score is computed by combining the 
individual segment scores, e.g., the FAST A method (Pearson and 
Lipman) , the BLAST method (Altschul) and the BLAZE method 
(Brutlag) . 

20 

Application of existing alignment -based similarity scoring 
methods is problematic in applications where a high degree of 
sensitivity is required, i.e., where very similar sequences are 
being compared, e.g., two 1500-base 16S rDNA sequences 

25 differing by only 1-5 bases. An alignment-based similarity 
score, especially one based on local alignments such as FAST A 
(Pearson and Lipman) or BLAST (Altschul), will tend to 
emphasize the similarity of sequences and overlook small 
differences between them. In applications where small 

30 differences are critical, e.g., distinguishing the 16S RNA 
sequences of E. Coli K-12 (benign) and E. Coli 0157 H:7 
(pathogenic) , it is crucial to be able to detect small 
differences between sequences rather than similarities. 

35 An additional shortcoming of existing similarity scoring 

methods is that they fail to take into account the polymorphic 
nature of the sequences being compared, i.e., the fact that 
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more than one monomer unit may be present in a given sequence 
at a given position, and that the proportion of each monomer at 
that position may be variable such that a minor component may 
go undetected. Such polymorphisms can arise when the sequencing 
5 template is a polymorphic multicopy gene which has been 
amplified by the PCR. For example, consider a set of sequences 
which are polymorphic at a position m, e.g., sequences derived 
from a sample including 10 copies of a polymorphic gene. 
Furthermore, assume that the polymorphism is such that in 8 of 

10 the copies of the gene the nucleotide at position m is an A and 
in the remaining two copies of the gene the nucleotide is a C. 
Thus, in an ideal sequencing experiment, each of the members of 
the set would show a signal having an 80% A component and a 20% 
C component at position m. However, in reality, many automated 

15 sequencing methods do not have the capability to reliably 
detect the presence of a 20% minor component. In such a case, 
the basis set would show only an A nucleotide at position m 
while the true situation would be that 20% of the polymorphic 
genes have a C at that position. Using existing similarity 

20 scoring methods, position m would be deemed to be a non-match, 
i.e., existing methods would erroneously conclude that a test 
sequence that included a C at position m was not a member of 
the set of known sequences. 

25 Thus, what is needed is an alignment-based similarity 

scoring method (i) capable of quantitatively distinguishing 
very similar sequences and (ii) capable of taking into account 
the polymorphic nature of many biopolymer sequences in light of 
the inability of current sequencing technology to reliably 

30 detect a polymorphic nucleotide present as a minor component. 

SUMMARY 

The present invention is directed towards an alignment- 
based similarity scoring method for quantifying differences 
35 between closely related polymorphic biopolymer sequences, e.g., 
DNA, RNA, or protein sequences. 
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It is an object of the invention to provide an alignment- 
based similarity scoring method which is capable of 
meaningfully distinguishing sequences having a sequence 
homology of greater than 99%. 

5 

It is another object of the invention to provide an 
alignment-based similarity scoring method which is capable of 
distinguishing polymorphic sequences in a statistically 
meaningful way. 

10 

In a first aspect, the foregoing and other objects of the 
invention are achieved by a method comprising the steps of 
providing a test sequence and a basis set of sequences where 
the test sequence and the basis set of sequences are aligned; 

15 determining the identity of a monomer unit at a position m in 
the test sequence; and assigning a value of 1 to a local 
matching probability if the monomer unit at position m in 
the test sequence matches any members of the basis set at 
position m, or, assigning a value of between 0 and 1 to a local 

20 matching probability x m if the monomer unit at position m in the 
test sequence does not match any members of the basis set at 
position m. Preferably, if the monomer unit at position m in 
the test sequence does not match any members of the basis set 
at position m, x m is assigned a value of 

25 

where p is a number between 0 and 1 and n is the number of 
sequences in the basis set at position m. Preferably, p is 
between 0.4 and 0.6, and more preferably p is 0.5. In a second 

30 preferred embodiment, the step of determining the identity of 
a monomer unit at a position m in the test sequence and the 
step of assigning a value to the local matching probability x m 
are performed at a plurality of positions m in the test 
sequence such that a plurality of local matching probabilities 

35 x m are determined; and a global matching probability for the 
basis set and the test sequence is computed, X G , by forming a 
product of the plurality of x m . Preferably, the local matching 

-4- 



WO 98/20433 



PCT/US97/19491 



probabilities are determined for each position m in the test 
sequence and the global matching probability for the basis set 
and the test sequence is determined by computing the product 

M 

5 

In yet another preferred embodiment, the above-described 
method is performed on each of a plurality of test sequences, 
and a statistical measure of a combined value of the local or 
global matching probabilities is determined, e.g., an average 
10 value, a standard deviation, a maximum value, or a minimum 
value . 

In a further preferred embodiment of the method of the 
invention, the above-described method is performed using a 
15 plurality of values of p and an optimum value of X G is 
determined. 

In a second aspect, the invention comprises a program 
storage device readable by a machine, tangibly embodying a 
20 program of instructions executable by a machine to perform the 
above-described method steps to quantify differences between 
closely related aligned biopolymer sequences. 

In a third aspect, the invention includes a computer 
25 system for determining a similarity score for a test sequence 
and a basis set of sequences comprising an input device for 
inputting a test sequence and a basis set of sequences such 
that the test sequence and the basis set of sequences are 
aligned; a memory for storing the test sequence and basis set; 
30 a processing unit configured for determining the identity of a 
monomer unit at a position m in the test sequence; and 
assigning a value of 1 to a local matching probability if 
the monomer unit at position m in the test sequence matches any 
members of the basis set at position m, or, assigning a value 
35 of between 0 and 1 to a local matching probability x w if the 
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monomer unit at position m in the test sequence does not match 
any members of the basis set at position m. 

These and other objects, features, and advantages of the 
5 present invention will become better understood with reference 
to the following description, drawings, and appended claims. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGS. 1, 2, and 3 are flow charts depicting various 
10 preferred similarity scoring methods of the invention. 

FIG. 4 is a schematic diagram of a preferred computer 
system of the invention. 

15 FIG. 5 shows an alignment of an exemplary basis set and 

test sequence. 

FIG. 6 shows the two basis sets and a test sequence to 
be compared by both the method of the present invention and 
20 the FAST A method. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Reference will now be made in detail to the preferred 
embodiments of the invention, examples of which are illustrated 

25 in the accompanying drawings. While the invention will be 
described in conjunction with the preferred embodiments, it 
will be understood that they are not intended to limit the 
invention to those embodiments. On the contrary, the invention 
is intended to cover alternatives, modifications, and 

30 equivalents, which may be included within the invention as 
defined by the appended claims. For the sake of clarity, the 
method and apparatus will be described primarily with respect 
to polynucleotide sequences, however it will be apparent to one 
of ordinary skill in the art that the concepts discussed are 

35 applicable to any experimentally derived collection of 
biopolymer sequences. 
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I. DEFINITIONS 

Unless stated otherwise, the following terms and phrases 
as used herein are intended to have the following meanings: 

5 The term "monomer unit" refers to an individual unit 

making up a biopolymer sequence, e.g., a particular amino acid 
in a protein or a particular nucleotide in a polynucleotide. 
In the case of a polynucleotide sequence, the monomer may be a 
combination of nucleotides, nomenclature of such combinations 
10 being defined by the IUB code as follows (Nomenclature 
Committee) 
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The term "polymorphism" refers to a location in a sequence 
25 at which more than one monomer unit resides, e.g., an A 
nucleotide and a G nucleotide. Such polymorphisms may arise 
when the sequencing template is made up of multiple 
polynucleotides having different nucleotides at a particular 
position. 

30 

The term "test sequence" refers to a biopolymer sequence 
to be compared to a basis set of biopolymer sequences. 

The term "basis set" refers to a collection of biopolymer 
35 sequences to be compared to a test sequence. 

The term "minor component" refers to a monomer unit at a 
polymorphic position which has the smaller of any two signals 
at that position. The term "major component" refers to a 
40 monomer unit at a polymorphic position which has the larger of 
any two signals at that position. 
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A "match" occurs when a monomer unit at a position m in 
a test sequence is present at the position m of any one of the 
members of a basis set of sequences. In the case of 
polynucleotide sequences, either one of two types of matches 
5 may be employed in the methods of the invention depending upon 
how monomer units represented by IUB codes are treated. In a 
first type of match, referred to as an "exact match", the 
monomer unit of the test sequence and the members of the basis 
set must match exactly, including monomer units represented by 

10 IUB codes. Thus, if a test sequence contained a W W" (A and T) 
at position m, a basis set containing only a T at that position 
would not be considered a match. Alternatively, in a second 
type of match, referred to as an "IUB match", a match with 
either of the members of the IUB pair would be scored as a 

15 match. Thus, if a test sequence contained a W W" (A and T) at 
position m, a basis set containing only a T at that position 
would be considered a match. Either type of match may be 
applied to the methods of the present invention. 

20 II. SCORING METHOD 

The similarity scoring method of the present invention is 
directed to a method for scoring aligned biopolymer sequences 
such that small differences between otherwise identical 
sequences are highlighted and such that the polymorphic 

25 character of the sequences is accounted for in a quantitative, 
statistically meaningful way. Generally, the method of the 
invention includes the following steps. A test sequence and a 
basis set of sequences are provided wherein the test sequence 
and the basis set of sequences are aligned. The identity of a 

30 monomer unit at a position m in the test sequence is 
determined. A local matching probability x n is determined 
where a value of 1 is assigned to the local matching 
probability if the monomer unit at position m in the test 
sequence matches any of the members of the basis set at 

35 position m. Alternatively, a value of between 0 and 1 is 
assigned to the local matching probability x m if the monomer 
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unit at position m in the test sequence does not match any of 
the members of the basis set at position m. 



A. Test Sequence and Basis Set of Sequences 
5 Al. Test Sequence: A test sequence according to the 

similarity scoring method of the invention may be any 
biopolymer sequence of interest, e.g., protein, nucleic acid, 
PNA, and the like. Preferably, the test sequence is a protein 
or nucleic acid sequence. More preferably, the test sequence 
10 is a nucleic acid sequence. According to the nomenclature used 
herein, the test sequence is described as an M-element linear 
array of monomer units located at positions m equal to 1 
through M. 

15 The test sequence may be derived from any biological 

organism or remains thereof. For example, the test sequence 
may be a gene coding for a 16S RNA molecule of a medically 
important microorganism. In one preferred alternative, the 
test sequence is a consensus sequence derived from a collection 

20 of biopolymer sequences. In an alternative preferred 
embodiment, the test sequence is derived from an assembly of 
partially overlapping sequences. 

A2. Basis Set of Sequences: A basis set according to the 
25 invention comprises a set of biopolymer sequences derived from 
a plurality of related basis templates located in a biological 
sample. The basis set may be composed of sequences which are 
derived from homomorphic polynucleotide templates, e.g., 
templates derived from a single copy cloned gene. In such a 
30 case, any polymorphism seen in the sequence of a member of the 
basis set is due only to an erroneous base call caused by the 
inherent variability of the sequencing process, e.g., 
variability due to enzymatic misincorporation of 
dideoxynucleotide triphosphate terminators, incomplete 
35 resolution of neighboring species in a sequencing gel resulting 
in signal overlap, finite detection limits of labels, 
uncertainties associated with the particular base-calling 
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algorithm used, contamination of samples, and the like. 

Alternatively, the basis set may be composed of sequences 
which are derived from polymorphic polynucleotide templates, 
5 e.g., templates derived from PCR amplification of a multicopy 
gene wherein the multiple copies have different sequences. 
Here, the variability among members of the basis set is due to 
both the inherent variability of the sequencing process and the 
true sequence differences among the templates used to derive 
10 the basis set. 

The basis set can be conveniently described as an NxM 
matrix where N is the total number of sequences in the basis 
set and M is the number of monomer units making up the test 
15 sequence. 

B. Alignment of Test Sequence and Basis Set 

As described in the Background section of this disclosure, 
alignment refers to the process of introducing "phase shifts" 
20 and "gaps" into sequences being compared in order to maximize 
the similarity between the two sequences. Any method for 
sequence alignment may be used with the similarity scoring 
methods of the present invention. Exemplary alignment methods 
include CLUSTAL (Higgins) and Needleman-Wunsch (Needleman) . 

25 

C. Scoring Relatedness of a Test Sequence and a Basis Set 
CI. Scoring of individual monomer units: To assign a 

quantitative similarity score to the relatedness of a monomer 
unit at a given location m in a test sequence and the set of 

30 monomer units at the same location m in a basis set of 
sequences, a value for a local matching probability x m is 
assigned to the position m, where the local matching 
probability is the probability that a monomer unit at a 
position m in the test sequence is a member of the set of 

35 monomer units at position m in the sequences making up the 
basis set, and l-x m is the probability that a monomer unit at 
position m in the test sequence is not a member of the set of 
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monomer units at position m in the sequences making up the 
basis set. The method is generally described in the flow chart 
of FIG. 1. 

5 In the similarity scoring method of the invention, if the 

monomer unit at position m in the test sequence matches any of 
the members of the basis set at position m, x m is assigned a 
value of 1. Thus for example, if the test sequence is 

10 ACCGT 
and the basis set is 

ACAGG 
ACGGA 

15 ACAGT 

the value of x 5 would be 1 because of the presence of a T at 
position 5 of the third member of the basis set. 

20 Alternatively, if the monomer unit at position m in the 

test sequence does not match any of the members of the basis 
set at position m, the local matching probability x m is 
assigned a value of between 0 and 1. Conceptually, x„ 
corresponds to a maximum probability that a monomer unit is in 

25 fact present at position m in at least one of the basis 
templates used to generate the basis set yet is not represented 
in the basis set itself because of the inability of the 
sequencing method used to generate the basis set to detect the 
monomer unit. Thus, even if the monomer unit is not 

30 represented at position m in any of the members of the basis 
set, the method of the invention assigns a finite probability 
that such monomer unit is in fact present in the population of 
basis templates used to generate the basis set, but is present 
at levels below that which the sequencing method employed to 

35 generate the basis set is able to detect. 
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A preferred method for determining the value of x m when 
the monomer unit at position m does not match the members of 
the basis set of N sequences is according to the relation 



where p is a number between 0 and 1 and n is the number of 
sequences in the basis set having an element at position m. 
Note that when the sequences of the basis set overlap at every 
position m, then n=N for each position m. However, when some 
10 members of the basis set do not overlap other members at 
certain positions, then n<N at those nonoverlapping positions 
m in the sequence. 

Conceptually, the value of p is a measure of the 

15 sensitivity of the sequencing system used to generate the 
sequences making up the basis set, i.e., the ability of a 
sequencing system to detect minor components in a signal 
including both major and minor components. Sensitivity is 
determined by such factors as the detectability of the labels 

20 used to label the sequencing fragments, the ability of the 
analysis software to distinguish overlapping peaks in an 
electropherogram, and the like. A large value of p indicates 
that the sequencing system is highly sensitive while a small 
value of p indicates that the sequencing system has poor 

25 sensitivity and would miss all but the largest minor 
components. For example, consider a basis set composed of 5 
sequences overlapping at position m, i.e., n=5. The value of x tt 
for three different values of p when there are no matches 
between the basis set and the test sequence according to the 

30 relation provided above are 



p=0.9 x ra = 0.001% 

p=0.5 x n = 3.1% 

p=0.1 x„= 59.0%. 

35 Thus, when p=0.9, i.e., a sequencing sys:..2m having good 
sensitivity, the calculated probability that a monomer unit at 
position m of the test sequence is not present in the oasis set 
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but is present as a minor component of the basis templates used 
to derive the basis set is very small, i.e., 0.001%. 
Conversely, when p=0.1, i.e., a sequencing system having poor 
sensitivity, the calculated probability that a monomer unit at 
5 position m of the test sequence does not match the basis set 
but is present as a minor component of the basis templates used 
to derive the basis set is significant, i.e., 59.0%. 

A practical consequence of choosing a large or small value 
of p relates to the likelihood of false positive results vs. 
false negative results, a false negative result being a result 
indicating a test sequence is not a member of the basis set 
when in fact it is a member of the basis set, and a false 
positive result indicating a test sequence is a member of the 
basis set when in fact it is not a member of the basis set. If 
a large value of p is chosen, e.g., greater than 0.6, the 
likelihood of a false negative result is increased, while if a 
small value of p is chosen, e.g., less than 0.4, the likelihood 
of a false positive result is increased. Preferably, to 
balance the effects of false positive and false negative 
results, p is chosen to be from 0.4 and 0.6. More preferably, 
p is chosen to be approximately 0.5. 

C2 . Scoring of multiple monomer units: To assign a 
25 similarity score to the relatedness of a test sequence and a 
basis set of sequences based on a plurality of monomer units 
located at a plurality of positions m in the sequences, a value 
for the local matching probability x^ is determined for each of 
a plurality of monomer units located at a plurality of 
30 positions m in the sequences. Then, a global matching 
probability X G is computed by forming the product of the 
individual matching probabilities. The method is generally 
described in the flow chart of FIG. 2. In a preferred 
embodiment, the value of x m is determined for each position of 



-13- 



15 



^DOCID 'WO . 98Z0433A ' . I _ > 



WO 98/20433 



PCT/US97/19491 



the test sequence and the product of all of the values of x m is 
computed according to the relation 

i 

5 This preferred embodiment is generally described in the flow 
chart of FIG. 3. 

C3. Scoring based on multiple test sequences : In an 
alternative embodiment of the scoring method discussed above, 

10 rather than comparing a single test sequence with a basis set 
of sequences, a set of test sequences is compared with the 
basis set. In this embodiment, a local or global matching 
probability is determined for each member of the set of test 
sequences individually according to the methods described 

15 above. Then, any measure of the combined local or global 
matching probability for the set of test sequences may be 
determined, e.g., an average value of the matching probability 
including standard deviation, maximum values, minimum values, 
log X G , or any other like statistical measures. 

20 

C4. Scoring based on variable value of the parameter p: 
In yet another alternative embodiment of the similarity scoring 
methods of the invention, rather than fixing the value of the 
parameter p at a constant value in the calculation of a 

25 matching probability, p is varied over a range of values. In 
this method, for a fixed value of p, a local or global matching 
probability is determined for an individual test sequence or 
set of test sequences as described above. Then, the value of 
p is changed, and the calculation of matching probabilities is 

30 repeated using the new value of p. This process is then 
repeated for a plurality of different values of p. Then, an 
optimum value or range of values of the matching probability is 
determined. This method using a variable value of p is 
particularly preferred when the test sequence is made up of a 
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set of multiple test sequences as described in Section C3 
above . 



III. COMPUTER SYSTEM AND PROGRAM STORAGE DEVICE 

5 The steps of above-describe scoring method are preferably 

performed by a computer. In one preferred embodiment, the 
computer is made up of a processing unit/ memory/ I/O device, 
and associated address/data bus structures for communicating 
information therebetween. See FIG. 4. The microprocessor can 

10 take the form of a generic microprocessor driven by 
appropriate software, including RISC and CISC processors, a 
dedicated microprocessor using embedded firmware, or a 
customized digital signal processing circuit (DSP) which is 
dedicated to the specific processing tasks of the method. The 

15 memory may be within the microprocessor, i.e., level 1 cache, 
fast S-RAM, i.e., level 2 cache, D-RAM, or disk, either optical 
or magnetic. The I/O device may be any device capable of 
transmitting information between the computer and the user, 
e.g., a keyboard, mouse, network card, and the like. The 

20 address/data bus may be PCI bus, NU bus, ISA, or any other like 
bus structure. 

When the method is performed by a computer, the above- 
described method steps are embodied in a program storage device 
25 readable by a machine, such program storage device including a 
computer readable medium. Computer readable media include 
magnetic diskettes, magnetic tapes, optical disks. Read Only 
Memory, Direct Access Storage Devices, gate arrays, 
electrostatic memory, and any other like medium. 

30 

IV. EXAMPLES 

The invention will be further clarified by a consideration 
of the following examples, which are intended to be purely 
exemplary of the invention and not to in any way limit its 
35 scope. 
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EXAMPLE 1 

Scoring the Similarity Between E. Coli Strain B and E. coli 

Strain 0157 H:7 

5 FIG. 5 shows an alignment of a basis set comprising 

multiple sequencing runs E. Coli Strain B (Sigma Chemical Co. 
p/n D4889) and a test sequence comprising a strain of E. Coli 
0157 H:7. The DNA sequences were obtained using the ABI PRISM™ 
Dye Terminator Cycle Sequencing Kit, AmpliTaq FS in combination 

10 with the ABI PRISM* Model 377 DNA Sequencer according to 
manufacturers instructions (PE Applied Biosystems, Division of 
The Perkin-Elmer Corporation {PEABD) , p/n 402080). The 
sequences were aligned using the Sequence Navigator 1 * software 
which employs the CLUSTAL multiple alignment method (PEABD p/n 

15 401615) . 

As shown in FIG. 5, all 5 replicates of the Strain B basis 
set show a base assignment of A at position m=7, while the 0157 
H:7 test sequence shows a G at that position. 

20 

The value of x m at position m=7 of the 0157 H:7 test 
sequence was determined where p=0.5 and n=5 resulting in a 
value of (.5) 5 = 3.13%. The same procedure was applied at 
positions m=9 (W vs. T) and m=26 (Y vs. T) . Based on only 
25 three base differences, it was inferred that the 0157 H:7 test 
sequence is not a member of the basis set of Strain B sequences 
with a probability of greater than 99.99%, i.e., ( 1- (3 . 13%) 3 ) . 

EXAMPLE 2 

30 Comparison of the Method of the Invention with the 

FAST A Method for Scoring Related Sequences 

In this example, a similarity score was calculated for a 
test sequence and each of two basis sets of sequences using 
35 both the method of the invention and the FAS TA method. 

FIG. 6 shows the two basis sets and the test sequence used 
in this comparison. The first basis set, set 1, is composed of 
sequences 6-8 in the figure. These sequences were obtained from 
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clinical isolates of E. coli strain 0157. The second basis 
set, set 2, is composed of sequences 1-4 in the figure. These 
sequences were obtained from four replicate sequencing runs of 
E. coli strain B. The test sequence, sequence 5 in the figure, 
5 is a clinical isolate of E. coli strain 0157. Thus, the test 
sequence is a member of set 1, in fact, the sequences are 
identical, but is not a member of set 2. The arrows at 
positions 106, 114, 121, 137, 149, 192, 208, and 220 in the 
figure indicate positions at which the sequences of set 2 are 
10 polymorphic with respect to each other. The arrows at 
positions 202, 206, 219, 221, 222, 223 and 238 in the figure 
indicate positions at which none of the sequences of set 2 
match the test sequence. Note that in this experiment, only 
exact matches were counted as a match. 

15 

Scoring the similarity of the test sequence with set 1 and 
set 2 using the FAST A method as implemented in the GeneAssist™ 
software package (PEABD p/n 402233), using a k-tuple of 2, 
resulted in similarity scores of 1996 and 1942, respectively. 
20 Even though the test sequence is a member of set 1 and is not 
a member of set 2, the similarity scores only differed by 
approximately 2.5%. Thus, the FAST A method was not able to 
clearly distinguish which of the two basis sets the test 
sequence was a member of. 

25 

Scoring the similarity of the test sequence with set 1 and 
set 2 using the scoring method of the invention resulted in 
scores of essentially 100% and 0%, respectively, where p was 
set at 0.5 and n was set at 3 for comparison with set 1 and 4 
30 for comparison with set 2. Thus, the scoring method of the 
invention clearly indicated the fact that the test sequence was 
a member of set 1, and that the test sequence was not a member 
of set 2, there being 7 mismatches between set 2 and the test 
sequence . 

35 

All publications and patent applications are herein 
incorporated by reference to the same extent as if each 
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individual publication or patent application was specifically 
and individually indicated to be incorporated by reference . 

Although only a few embodiments have been described in 
5 detail above, those having ordinary skill in the art will 
clearly understand that many modifications are possible in the 
preferred embodiment without departing from the teachings 
thereof- All such modifications are intended to be encompassed 
within the following claims. 
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WE CLAIM: 

1. A method for determining a similarity score for a test 
sequence and a basis set of sequences comprising the steps of: 

5 (a) providing a test sequence and a basis set of 

sequences such that the test sequence and the basis set of 
sequences are aligned; 

(b) determining the identity of a monomer unit at a 
position m in the test sequence; 

10 (c) assigning a value of 1 to a local matching 

probability x m if the monomer unit at position m in the 
test sequence matches any members of the basis set at 
position m, or, assigning a value of between 0 and 1 to 
a local matching probability x m if the monomer unit at 

15 position m in the test sequence does not match any members 

of the basis set at position m. 

2. The method of claim 1 wherein if the monomer unit at 
position m in the test sequence does not match any members of 

20 the basis set at position m, x,,, is assigned a value of 



where p is a number between 0 and 1 and n is the number of 
25 sequences in the basis set at position m. 



3. The method of claim 1 wherein p is between 0.4 and 



0.6. 



30 4. The method of claim 1 wherein p is 0.5. 

5. The method of claim 1 further comprising the steps 

of: 

performing steps (b) and (c) at a plurality of 
35 positions m in the test sequence thereby determining a 

plurality of local matching probabilities x m ; and 

-19- 

DOCID <WO 9fl20433AlJ.> 



WO 9^20433 



POYUS97/19491 



10 



15 



determining a global matching probability for the 
basis set and the test sequence, X G , by forming a 
product of the plurality of x ffl . 

6. The method of claim 5 wherein the global matching 
probability for the basis set and the test sequence, X G/ is 
determined by computing the product 



i 



7. The method of claim 1 wherein the test sequence is a 
16S RNA sequence from a microorganism, and the basis set 
comprises a plurality of 16S RNA sequences derived from a 
collection of microorganisms. 



8. The method of claim 1 further comprising: 

performing steps (a) -(c) on each of a plurality of 
test sequences; and 

determining a statistical measure of a combined value 
20 of the local matching probabilities selected from the 

group consisting of an average value, a standard 
deviation, a maximum value, and a minimum value. 

9. A method for determining a similarity score for test 
25 sequence and a basis set of sequences comprising the steps of: 

(a) providing a test sequence and a basis set of 
sequences wherein the test sequence and the basis set of 
sequences are aligned; 

(b) determining the identity of a monomer unit at a 
30 position m in the test sequence; 

(c) assigning a value of 1 to a local matching 
probability x m if the monomer unit at position m in the 
test sequence matches any members of the basis set at 
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position m, or, assigning a value of 

to the local matching probability x m if the monomer unit 
5 at position m in the test sequence is not present in any 

members of the basis set at position m, where p is a 
number between 0 and 1 and n is the number of sequences in 
the basis set at position m; 

(d) changing the value of p and repeating step (c) ; 

10 and 

(e) determining a range of values of p corresponding 
to the maximum value of x n . 

10. A program storage device readable by a machine, 
15 tangibly embodying a program of instructions executable by a 
machine to perform method steps to quantify differences between 
closely related aligned biopolymer sequences, said method steps 
comprising: 

(a) receiving a signal representing a test sequence; 

20 (b) determining the identity of a monomer unit at a 

position m in the test sequence; and 

(c) assigning a value of 1 to a local matching 
probability x m if the monomer unit at position m in the 
test sequence matches any members of the basis set at 

25 position m, or, assigning a value of between 0 and 1 to a 

local matching probability x B if the monomer unit at 
position m in the test sequence does not match any members 
of the basis set at position m. 

30 11- The program storage device of claim 10 wherein if 

the monomer unit at position m in the test sequence does not 
match any members of the basis set at position m, x m is 
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where p is a number between 0 and 1 and n is the number of 
5 sequences in the basis set at position m. 

12. The program storage device of claim 10 further 
comprising the steps of: 

performing steps (b) and (c) at a plurality of 
10 positions m in the test sequence thereby determining a 

plurality of local matching probabilities x B ; and 

determining a global matching probability for the 
basis set and the test sequence, X G , by forming a product 
of the plurality of x„. 



15 



20 



13. The program storage device of claim 12 wherein the 
global matching probability for the basis set and the test 
sequence, X G/ is determined by computing the product 



* G =rk 
i 



14. The program storage device of claim 10 wherein the 
test sequence is a 16S RNA sequence from a microorganism, and 
the basis set comprises a plurality of 16S RNA sequences 

25 derived from a collection of microorganisms. 

15. A computer system for determining a similarity score 
for a test sequence and a basis set of sequences comprising: 

an input device for inputting a test sequence and a 
30 basis set of sequences such that the test sequence and the 

basis set of sequences are aligned; 

a memory for storing the test sequence and basis set; 
a processing unit configured for: 

determining the identity of a monomer unit at a 
35 position m in the test sequence; and 
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assigning a value of 1 to a local matching 
probability x w if the monomer unit at position m in the 
test sequence matches any members of the basis set at 
position m, or, assigning a value of between 0 and 1 to 
5 a local matching probability x ra if the monomer unit at 

position m in the test sequence does not match any members 
of the basis set at position m. 



"XXiQ <WC . 9820A33A' 



-23- 



WO 98/20433 



PCT/US97/19491 



Test Sequence Basis Set 



1 




> 


Align Test Sequence 
and Basis Set 






Identity Monomer Unit 
m in Test Sequence 




r 


Determine the 
Local Matching Probability 
at position m, x m 







Report x m 



Fig. 1 

1/6 



WO 98/20433 



PCT/US97/19491 



Test Sequence 



Basis Set 



Align Test Sequence 
and Basis Set 






Identify Monomer Unit 
m in Test Sequence 




r 


Deterrr 
Local Matchii 
at posit 


line the 

ig Probability 

ion m, x m 



increment m 
to next desired 
monomer unit 



N 



Determine the 
Global Matching 
Probability X G 



Report X G 



Fig. 2 

2/6 



WO 98/20433 



PCT7US97/19491 



Test Sequence Basis Set 



Align Test Sequence 
and Basis Set 




f 


Set m=1 






Determ 
Local Matchir 
at pos 


ine the 

ig Probability 

ition m 



Index m = m + 1 

Yes 

y 

Compute Product 

M 

Xc=7rx m 

I 

Report X a 

Fig. 3 

3/6 



XXlC <WG ._9920433Ai 




WO 98/20433 



PCT7US97/19491 





Data 
Bus 




Processor 




Memory 




Address 





I/O Device 



Fig. 4 

4/6 



DOCC <WO_._9820«3A1_I_> 



WO 98/20433 



PCT/VS97/19491 



Basis Set 



m=7 m=9 



GAGATG AR WATTGTGCCTTCG 
GAGATGAAWAKTGTGCCTTCG 
GAGATGA RWAKTGTGCCTTCG 
GAGATGAAWADTGTGCCTTCG 
GAGATGA A WAKTGTGCCTTCG 



m=26 

GGAACYGTGA 
GGAACYGTGA 
GGAAC YGTGA 
GGAACYGTGA 
GGAACYGTGA 



Test Sequence 



GAGATGGATT GGTGCCTTCG GGAACTGTGA 



Fig. 5 
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